import warnings
warnings.filterwarnings('ignore')9: Summary statistics
In analyzing the New York City Motor Vehicle Collision dataset, several summary statistics can provide insights into the nature and impact of collisions across the city. Here are some meaningful summary statistics for this dataset we are planing to explore:
Average Number of Collisions Per Day: This statistic helps understand the daily frequency of collisions, providing a baseline for identifying days with unusually high or low numbers of incidents. It’s a key indicator of overall traffic safety.
Median Number of Persons Injured in Collisions: The median gives a better sense of the typical collision severity by showing the middle value of injuries in all reported collisions. It’s less influenced by extreme values than the mean, making it a reliable measure of typical outcomes.
Percentiles for Number of Fatalities in Collisions Percentiles (such as the 90th, 95th, and 99th) for fatalities can help identify the severity distribution of the most lethal collisions. Understanding the tail of this distribution is crucial for targeted interventions on the most dangerous incidents.
Average Number of Pedestrians, Cyclists, and Motorists Involved in Collisions Breaking down the average number of pedestrians, cyclists, and motorists involved in collisions can highlight which road users are most at risk. This can inform targeted safety campaigns or infrastructure improvements.
import pandas as pd
# Load the dataset
data_path = 'output/datasets/dataset_cleaned.csv'
data = pd.read_csv(data_path)
# Display basic summary statistics for numerical columns
summary_stats = data.describe()
# Displaying the results
print(summary_stats) latitude longitude number_of_persons_injured \
count 1.801703e+06 1.801703e+06 1.801687e+06
mean 4.072418e+01 -7.391972e+01 3.204896e-01
std 7.919794e-02 8.589505e-02 7.047761e-01
min 4.049895e+01 -7.425496e+01 0.000000e+00
25% 4.066813e+01 -7.397444e+01 0.000000e+00
50% 4.072079e+01 -7.392673e+01 0.000000e+00
75% 4.077008e+01 -7.386702e+01 0.000000e+00
max 4.091288e+01 -7.370055e+01 4.300000e+01
number_of_persons_killed number_of_pedestrians_injured \
count 1.801675e+06 1.801703e+06
mean 1.501381e-03 6.008815e-02
std 4.089376e-02 2.512263e-01
min 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 8.000000e+00 2.700000e+01
number_of_pedestrians_killed number_of_cyclist_injured \
count 1.801703e+06 1.801703e+06
mean 7.520662e-04 2.925676e-02
std 2.805393e-02 1.705714e-01
min 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 6.000000e+00 4.000000e+00
number_of_cyclist_killed number_of_motorist_injured \
count 1.801703e+06 1.801703e+06
mean 1.221067e-04 2.265379e-01
std 1.109964e-02 6.639425e-01
min 0.000000e+00 0.000000e+00
25% 0.000000e+00 0.000000e+00
50% 0.000000e+00 0.000000e+00
75% 0.000000e+00 0.000000e+00
max 2.000000e+00 4.300000e+01
number_of_motorist_killed collision_id
count 1.801703e+06 1.801703e+06
mean 6.005429e-04 3.367892e+06
std 2.670972e-02 1.389642e+06
min 0.000000e+00 1.579000e+03
25% 0.000000e+00 3.272651e+06
50% 0.000000e+00 3.786513e+06
75% 0.000000e+00 4.273420e+06
max 4.000000e+00 4.767930e+06
mean_injuries = data['number_of_persons_injured'].mean()
median_injuries = data['number_of_persons_injured'].median()
print("Mean number of persons injured:", mean_injuries)
print("Median number of persons injured:", median_injuries)Mean number of persons injured: 0.32048962999677527
Median number of persons injured: 0.0
import plotly.express as px
# Plotting the number of persons injured in each incident
fig = px.histogram(data, x='number_of_persons_injured', title='Distribution of Persons Injured per Incident')
fig.show()# Plotting incidents over time, assuming 'date/time' is properly formatted and cleaned
fig_time = px.histogram(data, x='date/time', title='Distribution of Incidents Over Time')
fig_time.show()